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Abstract 

We implemented a GPU based parallel code to perform Monte Carlo simu- 
lations of the two dimensional q-state Potts model. The algorithm is based 
on a checkerboard update scheme and assigns independent random numbers 
generators to each thread. The implementation allows to simulate systems 
up to ~ 10^ spins with an average time per spin flip of 0.147ns on the fastest 
GPU card tested, representing a speedup up to 155x, compared with an 
optimized serial code running on a high-end CPU. 

The possibility of performing high speed simulations at large enough sys- 
tem sizes allowed us to provide a positive numerical evidence about the ex- 
istence of metastability on very large systems based on Binder's criterion, 
namely, on the existence or not of specific heat singularities at spinodal tem- 
peratures different of the transition one. 
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1. Introduction 

The tremendous advances allowed by the usage of numerical simulations 
in the last decades have promoted these techniques to the status of indispens- 
able tools in modern Statistical Mechanics research. Notwithstanding, many 
important theoretical problems in the field still remain difficult to handle due 
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to limitations in the available computational capabilities. Among many oth- 
ers, typical issues that challenge the numerical treatment concern systems 
with slow dynamics (i.e., dynamical processes that involve very different 
time scales) and/or strong finite size effect, which require fast simulations 
of a very large number of particles. Some typical examples we rnay cite are 
spin glass transitions [H, glassy behavior ji], y] and grain growth Q. In such 
kind of problems the state of the art is usually launched by novel numerical 
approaches or extensive computer simulations. In this sense, the advent of 
massive parallel computing continuously opens new possibilities but, at the 
same time, creates a demand for new improved algorithms. In particular, 
the usage of GPU cards (short for Graphics Processing Units) as parallel 
processing devices is emerging as a powerful tool for numerical simulations 



of physics ll2|, [13 



a poweriui t 

B i a i i 



in Statistical Mechanics systems |5|, 16|, 17|, 18|, 19|, 110| , as well as in other areas 



These GPUs have a Toolkit that abstracts the end-user from many low- 
level implementation details, yet all the typical problems of concurrency ex- 
ists and they are magnified by the massive amount of (virtual) threads it is 
capable to handle. An extremely fine grained concurrency is possible and 
advised thanks to the Single Instruction Multiple Thread (SIMT) model. 
Therefore, any non trivially independent problem requires a correct concur- 
rency control (synchronization), and the lack of it hinders correctness in a 
much dramatic way than current 4 or 8-way multicore CPU systems. The 
other challenge apart from correctness is performance, and here is where 
the algorithm design practice excels. Taking into account internal mem- 
ory structure, memory/computation ratio, thread division into blocks and 
thread internal state size, can boost the algorithm performance ten times 
from a trivial implementation It is also customary to give an approx- 
imation of the speedup obtained from a CPU to GPU implementation in 
terms of 'Wx", even though, as we discuss later, this number will always 
depend on the corresponding efforts devoted to optimally programming for 
each architecture. 

In this work we focus on GPU based Statistical Mechanics simulations 
of lattice spin systems. In particular, we study the metastability problem in 



the ferromagnetic g-state Potts model [15[ in two dimensions when g > 4. 
While this phenomenon is clearly observed in finite size systems, its persis- 
tence in the thermodynamics limit is still an unsolved problem and subject 



of debate [ly, ll7|, ll8|, ll9|, |20| . In an earlier work. Binder proposed a numerical 
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criterion to determine whether metastabihty remains in the thermodynamic 
hmit or not, based on the scahng properties of the average energy in the 



vicinity of the transition temperature [16|. However, the narrow range of 
temperature values of the metastable region requires high precision calcula- 
tions for the criterion to work. Hence, to reduce finite size bias and statistical 
errors down to an appropriated level, large enough system sizes are needed. 
The computation capabilities required to carry out such calculations in a 
reasonable time were unavailable until recently. 

We developed an optimized algorithm to perform Monte Carlo numerical 
simulations of the g-state Potts model on GPU cards. This algorithm allowed 
us to simulate systems up to = 32768 x 32768 ~ 1.073 x 10^ spins with a 
lower bound time of 0.147ns per spin flip using using an NVIDIA GTX 480 
Fermi card, and in terms of speedup, we obtained 155x from an optimized 
CPU sequential version running on an Intel Core 2 Duo E8400 at 3.0GHz. 

What is remarkable about the speedup is that it allowed us to explore 
bigger systems, simulate more iterations, explore parameters in a finer way, 
and all of it at a relatively small cost in terms of time, hardware and coding 
effort. With this extremely well performing algorithm we obtained a positive 
numerical evidence of the persistence of metastabihty in the thermodynamic 
limit for g > 4, according to Binder's criterion. 

The paper is structured as follows. In Section [2] we briefly review the 
main properties of the Potts model and the particular physical problem we 
are interested in. In Section |3] we introduce the simulation algorithm and in 
Section H] we compare the predictions of our numerical simulations against 
some known equilibrium properties of the model to validate the code. In 
Section we check the performance of the code. In Section [H] we present our 
numerical results concerning the metastabihty problem. Some discussions 
and conclusions are presented in Section [3 

2. The q-state Potts model 

2.1. The model 



The g-state Potts model [15| without external fields is defined by the 
Hamiltonian 

ff = -J^5(s„.,) (1) 

<i,j> 
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where Sj = l,2,...,g, S{si,Sj) is the Kronecker deha and the sum runs 
over all nearest neighbors pairs of spins in a Bravais lattice with sites. 
Being a generalization of the Ising model {q = 2), this model displays a 
richer behavior than the former. One of the main interests is that the two- 
dimensional ferromagnetic version ( J > 0) exhibit a first order phase transi- 
tion at some finite temperature when q > 4, while for g < 4 the transition 
is continuous 15|. Hence, it has become a paradigmatic model in the study 



of phase transitions and their associated dynamics, like for instance, domain 
growth kinetic s |2ll . |22| . |23| . |24| . |25| and nucleation as an equilibration mech- 



anism 



17|,|26| 



Some equilibrium properties of the two-dimensional model are known 
exactly, which allows numerical algorithms testing. We list here some of 
them that are used for comparison with the numerical results in the present 
work. For instance, the transition temperature for the square lattice in the 



thermodynamic limit is given by [28 



ksTc 



(2) 



J ln(l + 7g) 

where /c^ is the Boltzmann constant. Hereafter we will choose ks/J = 1 



Considering the energy per spin e 
latent heat for g > 4 is [28| 



(H) /N, in the thermodynamic limit the 



1 



e 



6^-60 = 2(1 + ^1 tanh — ]^(tanh nO^ 

^ V / n=l 

where 9 = arccos y/q/2 and 



(3) 



6d = lim ^ lim (H), 

TV— >oo iV T~i-T^ 



Zo = lim — lim (H). 



(4) 
(5) 



Also 



6d + 60 = -2(1 + l/V^) 
from which the individual values of Cd and 60 can be obtained 29 



(6) 
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The order parameter is defined as 



m 



(7) 



where Nmax = max{Ni, N2, . . . , Nq), being Ni the number of spins in state i. 



At the transition the jump in the order parameter (for g > 4) is given by 30 



Am 



3g~^ - 9g-3 - 27g"^ - 



(8) 



2.2. Metastability 

The problem of metastabihty in the infinite size g-state Potts model (for 
q > 4) is an old standing problem in statistical mechanics 16|, |3l|, |2J, |32 



18l . |20|. It has also kept the attention of the Quantum Chromo dynamics' 
(QCD) community for many years 33|, |3J, |3l|, [l9|, |35|, because it has some 
characteristics in common with the deconfining (temperature driven) phase 
transition in heavy quarks. 

Metastability is a verified fact in a finite system. It is known 24, 
18| that below but close to Tc the system quickly relaxes to a disordered 
(paramagnetic) metastable state, with a life time that diverges as the quench 
temperature T approaches Tc (see for example Fig. 4 in Ref. 20[). This state 
is indistinguishable from one in equilibrium in the sense of local dynamics, 
namely, two times correlations de pen ds only on the difference of times, while 
one time averages are stationary 
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Nevertheless, the existence of metastability in the thermodynamic limit 



is still an open problem [18|. In Ref. |16| Binder studied static and dynamic 
critical behavior of the model ([T]) for q = 3,4,5,6. Using standard Monte 
Carlo procedures he obtained good agreement with exact results for energy 
and free energy at the critical point and critical exponents estimates for 
g = 3 in agreement with high-temperature series extrapolations and real 
space renormalizat ion-group methods. When analyzing the g = 5 and 6 cases 
he realized that the transition is, in fact, a very weak first order transition, 
where pronounced "pseudocritical" phenomena occur. He studied system 
sizes from = 16 x 16 up to = 200 x 200, and observation times up 
to 10^ MC;^ (a Monte Carlo Step MCS is defined complete cycle of 

A^ spin update trials, according to the Metropolis algorithm). Within his 
analysis he was unable to distinguish between two different scenarios for the 
transition at g > 5 due to finite size effects taking place at the simulations. He 
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proposed two self- avoiding possible scenarios for the transition. In the first 
one the energy per spin reaches the transition temperature with a finite slope 
both coming from higher and lower temperatures, thus projecting metastable 
branches at both sides of the transition that end at temperatures and 
both different from Tc. In the second scenario, the energy reaches Tc with 
an infinite slope which would imply a first order phase transition with a true 
divergence of the specific heat at Tc. 

On the other hand, other approaches based on different definitions of 
the spinodal temperatures predict, either the convergence of the finite size 
spinodal temperatures to Tc 19 1 or a convergence to limit values different 



from but closely located to Tc [20 



3. Optimized GPU-based Monte Carlo algorithm for the q-state 
Potts model 

We developed a GPU based code to simulate the two dimensional Potts 
model, using classical Metropolis dynamics on square lattices of size N = Lx 
L sites with periodic boundary conditions. For the spin update we partition 
lattice sites in two sets, the whites and the blacks, laid out in a framed 
checkerboard pattern in order to update in a completely asynchronous way 
all the white cells first and then all the black ones (given that the interactions 
are confined to nearest neighbors). This technique is also know as the Red- 



Black Gauss-Seidel [36|. We analyzed equilibrium states of systems ranging 
from = 16 X 16 to = 32768 x 32768 {2^^ x 2^^ ~ 1.073 x 10^ spins). 

The typical simulation protocol is the following. Starting from an initial 
ordered state (s, = 1 Vz) we fix the temperature to T = T^m and run ttran 
to attain equihbrium, then we run tmax taking one measure each St steps to 
perform averages. After that, we keep the last configuration of the system 
and use it as the initial state for the next temperature, T = Tmm + <5T. This 
process is repeated until some maximum temperature T^ax is reached. We 
repeat the whole loop for several samples to average over different realizations 
of the thermal noise. In a similar way we perform equilibrium measurements 
going from T^ax to T^in starting initially from a completely random state. 

3.1. GPU: device architecture and CUD A programming generalities 

In 2006, NVIDIA decided to take a new route in GPU design and launched 
the G80 graphics processing unit, deviating from the standard pipeline de- 
sign of previous generations and transforming the GPU in an almost general 
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purpose computing unit. Although this decision could have been driven by 
the gaming community asking for more frames per second, NVIDIA took ad- 
vantage of his General Purpose Graphics Processing Units (GPGPU), and in 
2007 they launched the CUDA SDK, a software development kit tailored to 
program its G80 using C language plus minor extensions. The G80 hardware 
and the CUDA compiler quickly proved to have an extremely good relation 
in terms of GFLOPS per watt and GFLOPS per dollar with respect to the 
CPU alternatives in the application field of numerical algorithms. 

The architecture has evolved two generations, GT200 in 2008 and 2009, 
and the GFIOO in 2010, also known as the Fermi architecture. All of them 
share the same Single Instruction Multiple Thread (SIMT) concurrency paradigm 
in order to exploit the high parallelism (up to 480 computing cores) and the 
high memory bandwidth (up to 177GBps). The SIMT model is a convenient 
abstraction that lies in the middle of the SIMD (Single Instruction Multi- 
ple Data) and MIMD (Multiple Instruction Multiple Data), where the first 
reigned in the 80's with the vector computers, and the later is the com- 
monplace of almost every computing device nowadays, from cellphones to 
supercomputers. 

Using SIMT paradigm, the parallel algorithm development changes greatly 
since it is possible to code in a one-thread-per-cell fashion. The thread cre- 
ation, switching and destruction have such a low performance impact that 
doing a matrix scaling reduces to launch one kernel per matrix cell, even if the 
matrix is 32768 x 32768 of single precision floating point numbers summing 
up 1 GThread all proceeding in parallel. In fact, for the implementation, 
the more threads the better, since the high memory latency to global mem- 
ory (in the order of 200 cycles) is hidden by swapping out warps (vectors of 
32 threads that execute synchronously) waiting for the memory to become 
available. 

It is important to emphasize the role of blocks in the SIMT model. 
Threads are divided into blocks, where each block of threads have two special 
features: a private shared memory and the ability to barrier synchronize. Us- 
ing these capabilities, the shared memory can be used as a manually-managed 
cache that in many cases greatly improves the performance. 

We used the GTX 280, GTX 470 and GTX 480 boards. The relevant 
hardware parameters for these boards are shown in table [TJ 

The improvements of the Fermi architecture lies on the new computing 
capabilities (improved Instruction Set Architecture - ISA), the doubling of 
cores, the inclusion of LI and L2 cache, increased per-block amount of par- 
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Board Model 


GTX 280 


GTX 470 


GTX 480 


Available 


Q2 2008 


Ql 2010 


GPU 


GT200 


GFIOO 


CUD A capability 


1.3 


2.0 


CUDA cores 


240 


448 


480 


Processor Clock 


1.30GHz 


1.22GHz 


1.40GHz 


Global Memory 


1GB 


1.25GB 


1.50GB 


Memory Bandwidth 


141.7GBps 


133.9GBps 


177.4GBps 


LI Cache 


N/A 


16KB-48KB 


L2 Cache 


N/A 


768KB 


Max # of Threads per Block 


512 


1024 


Shared Memory per Block 


16KB 


48KB-16KB 


Max ^ of Registers per Block 


16384 


32768 



Table 1: Key features about NVIDIA GTX 280, GTX 470, and GTX 480 graphic cards. 



allelism and shared memory. 

As every modern computing architecture the memory wall effect has to 
be relieved with a hierarchy of memories that become faster, more expensive 
and smaller at the top. The bottom level is the global memory, accessible by 
every core and having from 1GB to 1.5GB of sizqj and a latency of 200 cycles. 
The next level is the shared memory, that is configurable 16KB or 48KB per 
block having a latency of only 2 cycles. At the top there are 32K registers per 
block. There are also texture and constant memory, having special addressing 
capabilities, but they do not bring any performance improvement in our 
application. The Fermi architecture has also incorporated EGG memory 
support to eventually deal with internal data corruption. 

The programming side of this architecture is a "G for GUDA" , an exten- 
sion of the G Programming Language js^] that enables the host processor to 



launch device kernels (38|. A kernel is a (usually small) piece of code that is 
compiled by nvcc, the NVIDIA GUDA Gompiler, to the PTX assembler that 
the architecture is able to execute. The kernel is executed simultaneously by 
many threads, organized in a two-level hierarchic set of parallel instances 



^This values apply to consumer graphics cards. The Tesla HFC line incorporates up 
to 6GB of memory (e.g. Tesla C2070), that is configurable to be ECC in order to improve 
reliability 
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indexed as {grid, block) (a grid of thread blocks). Internally each grid and 
block can be divided up to two dimensions for the grid and three dimensions 
for the block, in order to establish a simple thread-to-data mapping. Special 
variables store the thread position information of block and thread identifier 
{bid, tid) that distinguishes the threads executing the kernel. 

It is interesting to note that although the unit of synchronous execution is 
a warp of 32 threads, the threads inside a warp may diverge in their execution 
paths (occurrence of bifurcations), at the cost of having to re-execute the 
warp once for each choice taken. Needless to say that in general this impacts 
negatively in the performance and has to be avoided. 

The present code is divided in two main functions: spin update and 
energy and magnetization computation. The first function is implemented 
in host code by the function update and this comprises calling the device 
kernel updateCUDA once updating white cells and next updating black cells 
in a checkerboard scheme. The energy and magnetization (and their re- 
lated moments) summarization is done by calculate that calls the kernel 
calculateCUDA and two more auxiliary kernels: sumupECUDA and sumupMCUDA. 

3.2. Random Number Generator 

The Potts model simulation requires a great amount of random numbers. 
Namely, each cell updating its spin needs one integer random number in 
{0, . . . , g— 1} and possibly a second one in the real range [0, 1) to decide the 
acceptance of the flip. Hence, a key factor to performance is using a good 
parallel random number generator. 

Given the great dependence in terms of time (it has to be fast) and 
space (small number of per-thread variables), we find Multiply- With-Carry 
(MWC) [i^ ideal in both aspects. Its state is only 64 bits, and obtaining 
a new number amounts to compute Xn+i = {xn x a -f c„) mod h, where a 
is the multiplier, h is the base, and c„ is the carry from previous modulus 



operation. We took the implementation from the CUDAMCML package [40 
that fixes h = 2^^ in order to use bit masks for modulus computation. 

For independent random number sequences, MWC uses different multi- 
pliers, and they have to be good in the following sense: a x b — 1 should 
be a safeprime, where p is a safeprime if both p and {p — l)/2 are primes. 
Having fixed b = 2^^, the process of obtaining safe primes boils down to test 
for primality of two numbers goodmult{a) = prime{a x 2'^^ — 1) A prime{{a x 
232 _ 2)/2). It is important to remark that the nearer to 2^"^ is a the longer 
the period of the MWC (for a close to its maximum, the period is near to 
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2^^), therefore it is always advisable to start looking for goodmult down from 

We limit the amount of independent random number generators (RNG) to 
51272 = 131072 that is slightly lower than the 150000 good muhipliers that 
CUDAMCML gives in its file safe_primes_base32.txt. The state needed 
comprises 12 bytes per independent RNG, totalizing 1.5MB of global memory, 
less that 0.15% of the total available in the GTX 280. We consider this 
a good trade-off between independence in number generation and memory 
consumption. This design decision is crucial in the parallelization of the spin 
update function, as we frame the lattice in rectangles of 512 x 512, to give 
each thread an independent RNG0. Moreover, this implies that the larger the 
lattice, the more work will be done by a single thread. 

It is important to remark we are well below the RNG cycle even for the 
largest simulations. 

3.3. Spin update 

On top of the checkerboard division we have first to frame the lattice in 
rectangles of 512x512 in order to use the limited amount of independent RNG 
(Fig{Tl left). This implies launching two consecutive kernels (black/white) 
of 512 X 512/2 threads, typically organized into a grid of 32 x 16 blocks 
of 16 X 16 threads. The second step comprises the remapping of a two 
dimensional stencil of four points in order to save memory transfers. The 
row-column pair {i,j) is mapped to {{{i+j)mod 2 x L + i)/2,j), and this 
allows to pack all white and all black cells in contiguous memory locations 
improving locality and allowing wider reads of 3 consecutive bytes (FiglH 
right). 

We encode each spin in a byte, allowing simulations with q < 256 and 
< available RAM. Since some extra space is needed for the RNG state 
and for energy and magnetization summarization, this upper bound is not 
reached. The biggest simulations we achieve is L = 32768, g = 45 for the 
GTX 480. 

It is important to remark that shared memory is not used, since we could 
not improve performance and it hindered readability of the code. Texture 
memory techniques were not used for the same reasons. 



•^For system sizes smaller than N — 512^ we use smaller frames, and then, fewer RNG. 
But 512 X 512 is the standard framing choice for most of the work. 
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Figure 1: On the left: a 8 x 8 checkerboard framed in 4 x 4 (red marking), the cells 
updated by thread to are singled out, we also marked the north, east, south and west 
neighbors of cell •. On the right: packed checkerboard showing first half of whites, where 
the neighboring cells n,e,s,w are marked, also in the second half of black cells • is singled 
out. 



3.4- Computation of Energy and Magnetization 

During the evolution of the system we extract periodically two quantities: 
energy Eq.([T]) and magnetization Eq.([7]). The kernel responsible for this job 
is calculateCUDA. It first partitions the cells into CUDA blocks. In each 
block we have easy access to barrier synchronization and shared memory 
among its threads. Each block within its cells adds the local energies and 
accumulates in a partial vector (ni,n2, . . . ,nq) the number of spins in each 
state. This is performed in shared memory using atomic increments to avoid 
race conditions. After that, those blocks' results are added up in parallel 



using a butterfiy-like algorithm (38| by kernels sumupECUDA and sumupMCUDA, 



but none of the known optimizations [4l[ are applied, since it implies obfus- 



cating the code for a marginal global speedup. Previous kernels end up with 
up to approximately a thousand partial energies and vectors of spin counters, 
that are finally added in the CPU. 

It has to be noticed that device memory consumption in this part is linear 
not only in A^, but also in q. 

3.5. Defensive Programming Techniques and Code Availability 

Writing scientific code that is maintainable, robust and repeatable is of 
utmost importance for the fields of science where computer simulation and 



experimentation is an everyday practice [42 . 

CUDA coding in particular is hard, not only in creating the algorithms, 
choosing a good block division and trying to take advantage of all its capabil- 
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ities, but also, in the debugging and maintenance cycle. Debugging tools are 
evolving rapidly, for example there is a memory debugger cuda-memcheck 
that is shipped with current CUDA SDK. Nevertheless, we would rather ad- 
here to some passive and active security measures within our code to make 
it easier to understand and modify, and at the same time, to make it robust 
in the sense of no unexpected hangs, miscalculations or silent fails. 

Among passive security measures, we use assertions (boolean predicates) 
related to hardware limitations like the maximum of 512 threads per block. 
Other use of the assertions is checking for the integer representation limita- 
tions: given the computing power that GPGPU brings, lattices of 32768 x 
32768 are feasible to simulate, and integer overflow could be a possibility, 
for example when computing the total energy. Assertions were also used to 
enforce preconditions on algorithm running, for example, the spin updating 
cannot do well if L is not multiple of the frame size. We also check ev- 
ery return condition of CUDA library calls and kernels, in order to lessen 
the asynchrony of error detection in CUDA. The same practice is used in 
standard library calls for file handling. 

Active security measures are also taken. We use tight types in order to 
detect problems in compile time. We also decorate parameters and variable 
names with const modifiers where applicable. For pointer immutable pa- 
rameters we forbid the modification of pointed data as well as the pointer 
itself. The scope of automatic variables is as narrow as possible, declaring 
them within blocks, in order to decrease the namespace size in every line of 
code. We put in practice the simple but effective idea of using meaningful 
variable names in order to improve the readability. 



We also adhere to the practice of publishing the code 43| in the line 

of 0,0,8, 

since it benefits from community debugging and development. It 
can be found on 44 . 



4. Algorithm checking 

In order to validate our CUDA code we run some typical simulations to 
measure well established results. 

First we calculate the energy per spin e and magnetization m above and 
below the transition temperature, by cooling (heating) from an initially dis- 
ordered (ordered) state. The behaviors of e and m as functions of T for 
different values of q are shown in Figj2j From these calculations we obtain 
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0.2 0.4 0.6 0.8 



Figure 2: (Color online) Equilibrium energy per spin e and magnetization m (inset) versus 
temperature for q = 9, 12, 15, 96. Exact values at the transition point from equations ([3]), 
([6|) and ^ are marked as crosses. Data comes from averages over 10 samples of linear 
system size L = 2048. Error bars are smaller than the symbol size. 



the values of the energy (e^ and Co) and magnetization jump Am at the ex- 
act transition temperature (see Section [2]). Results are compared with exact 
values in table |2] 

We can see a very good agreement between data and exact results. It's 
worth noting that the data from table [2] is not the result of extrapolations of 
some finite size analysis, but the values from curves in Figj2]at the transition 
itself. Since we measure one point each AT in temperature, cooling and 
heating procedures won't necessary lead to a point measured exactly at Tc. 
So, we have to interpolate points close to Tc to deduce the corresponding 
values of Co, and rrio at Tc. The differences obtained from interpolations 
using points separated by AT and points separated by 2 x AT determine the 
estimated errors. 



We also calculate the fourth order cumulant of the energy 45|, |46 



as a function of the temperature for q = 6 and different system sizes. As it is 
well known, Vl is almost constant far away from the transition temperature 
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q 


-Co 


-ed 


Am 


exact 


calculated 


exact 


calculated 


exact 


calculated 


6 


1.508980... 


1.51(2) 


1.307516... 


1.306(1) 


0.677083... 


0.674(2) 


9 


1.633167... 


1.6332(5) 


1.033499... 


1.0334(5) 


0.834019... 


0.8338(4) 


15 


1.765905... 


1.7659(2) 


0.750492... 


0.7509(4) 


0.916693... 


0.9167(3) 


96 


1.960306... 


1.96030(3) 


0.243817... 


0.24382(4) 


0.989247... 


0.98924(2) 



Table 2: Comparison between calculated and known exact values of Bq, e^, and Am at the 
transition for different values of q. Results were obtained from averages over 10 samples of 
linear system size L = 2048 and equilibration and measurements times of at least 5 x 10^ 
MCS each one. 



and exhibits a minimum at a pseudo critical temperature 

t;(l) = t. + ^1M^J_ (10) 

In Figl3)D we show T*{L) vs. 1/L^ for g = 6. The extrapolated value of T*[L) 
for L ^ cx), 0.8078 ± 0.0002 agrees with the exact value = 0.8076068... 
within an accuracy of the 0.025%. 

Let us emphasize that, as it is well known, it's very difficult to get good 
measures of cumulants with a single spin flip MC algorithm. In order to 
get reliable averages of the cumulant minimum location, one should guaran- 
tee a measurement time long enough to let the system overcome the phase 
separating energy barrier back and forward several times. Moreover, the 
characteristic activation time to overcome the barrier increases both with q 
and L (it increases exponentially with L). For instance, simulation times of 
the order 10^ for each temperature are needed to obtain a good sampling for 
g = 6 and L = 256. 

In addition, we test our code for the q = 2 (Ising) case. FigJH shows the 
susceptibility of the order parameter calculated as 

X=^[{m')-{mf] (11) 

The extrapolated value of the pseudo critical temperature T*{L) (defined as 
the location of the susceptibility maximum) for L — t- oo, 1.1345 ± 0.0001, 
agrees with the exact valutl r,(g = 2) = 1.1345926... within an accuracy of 



It should be remembered that Jpotts = '^Jising if we compare our hamiltonian ([ij 
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Figure 3: (Color online) Finite size scaling of the fourth order cumulant for (7 = 6. (a) Vl 
as a function of temperature for different system sizes. Averages were taken over several 
samples ranging from 300 to 400 for small system sizes down to 50 and 20 for L — 128 
and L ~ 256. The orange line indicate the analytically predicted location of the minimum 
in the thermodynamic limit, (b) Pseudo critical temperature T* vs. Error bars, 

estimated from the uncertainty when locating the minimum of Vl, are shown only when 
larger than the symbol size. 



the 0.009%. Even more, if we plot the maximum value of x against the linear 
size L it is expected to observe a finite size scaling of the form Xmax ~ L'^^'^ 



47l |. where 7 and u are the exactly known critical exponents for the 2D Ising 
model. We obtain such scaling with a combined exponent ^jv = 1.77 ±0.02, 
in a good agreement with the exact value 7/^^ = = 1.75. 



5. Algorithm performance 

The first step towards performance analysis is the kernel function calling 
breakdown. In this case, it is done using CUDA profiling capabilities and 
some scripting to analyze a 2.9GB cuda_profile_0.log file produced after 
12, 6 hours of computation. The parameters used for this profiling are q = 
9, A = 2048 X 2048, T^,„ = 0.721200, T„,, = 0.721347, = 10"^, ttran = 
lO^MCS, tmax = lO^MCS and 6t = 500MCS. 

The profile shows that there are approximately 32 millions of calls to 
updateCUDA and just a few thousands to the other three kernels. Since the 
individual gpu time consumptions of each kernel are comparable, the only 



with the usual Ising hamiltonian, thus giving a Tc{q = 2) which is a half of the commonly 
appearing in Ising model works. 
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T 



Figure 4: (Color online) Finite size scaling of the susceptibility for q — 2. (main plot) 
X as a function of temperature for different system linear sizes. Averages were taken 
over several samples ranging from 300 for small system sizes down to 50 and 15 for L = 
1024 and L = 2048, respectively. We have used equally equilibration and measurement 
times of 2 X lO^MCS', measuring quantities each lOMCS, thus totalizing averages over 
6 X 10^ to 3 X 10^ as we increase the system size, (upper inset) Maximum value of 
the susceptibility peak Xmax ^'S. L. Error bars, estimated from the uncertainty when 
evaluating the maximum, are smaller than the symbol size, (lower inset) Pseudo critical 
temperature T*{L) vs. 1/L. Error bars, estimated from the uncertainty when locating 
the position of the maximum, are shown only when larger than the symbol size. 
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relevant kernel to analyze is updateCUDA. 

To analyze the kernel updateCUDA we sweep L in the range from 512 to 
32768 in powers of two, measuring the average execution time of the kernel 
and normalizing it to nanoseconds per spin flip. 

We compare the three GPUs, using the same machine code (Compute 
Capability - CC 1.3, generated by NVCC 3.2)0, and the same video driver 
(driver version 260.19). We also compare the GPUs performance with a CPU 
implementation. For this version, we tried to keep the structure of the CUDA 
code, in order to compare the execution of the same physical protocol on each 
architecture. We replaced the calls to CUDA kernels with loops running 
over all the spins in the same checkerboard scheme, we used the same MWC 
random number generator. We also added some optimizations to improve the 
CPU performance like creating a precomputed table of Boltzmann weights 
for the spinfiip acceptance for each simulated temperature, since the CPU 
have no mechanism for hiding memory latency and the impact of any floating- 
point unit (FPU) computation is noticeable. We run the CPU code against 
a Core 2 Duo architecture (E8400 - Ql 2008) using GCC 4.4.5 with carefully 
chosen optimization fiag^. 

We also vary q in the set {6, 8, 9, 12, 15, 24, 48, 96, 192}. We don't find any 
significant variation of the performance with g, except in the q = 2^ cases 
for the GTX 280, where the compiler obtains slight performance advantages 
using bitwise operators for modulus operation. The Fermi board has an 
improved modulus, rendering that difference imperceptible. 

The profiling measurement is done in the GPU cases using CUDA pro- 
filing capabilities that gives very precise results, avoiding any code instru- 
mentation. For the CPU version it is necessary to instrument the code with 
simple system calls to obtain the wall time. In order to make the measure- 
ment independent of the temperature range covered, given that the transi- 
tion temperature (and therefore the flip acceptance rate) changes with g, we 
choose a deterministic write, i.e. we always write the spin value irrespective 
if the spin changes its state respect of its previous state or not. Writing 
the spin value only when it changes its state, brings a slight performance 
improvement around 2% in the general case. 

In figure [5] we can see that the curve corresponding to the CPU imple- 



^Using CC 2.0 ISA does not bring any performance improvement. 
^Compiler options -03 -ff ast-math -march=native -f unroll-loops. 
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Figure 5; Spin flip time in nanoseconds vs. lattice size running on an Intel Core 2 Duo 
E8400@3.0GHz CPU, and running on GTX 280, GTX 470 and GTX 480 NVIDIA CPUs. 
Averages are performed over 400 runs for the CPUs and 60 runs for the CPU. Error bars 
are smaller than symbol sizes when not showed. 



mentation is flat arouncl^ 22.8ns, showing no dependence of tlie averaged spin 
flip time with system size. For GPU cases, instead, we do have variations 
with respect to L. The slowest card is the GTX 280, with spin flip times 
in the range [0.48ns, 0.54ns] which are 47x to 42x faster than those of the 
CPU code. The GTX 470 has a variation between 0.21ns and 0.30ns, giving 
a speedup between 108x and 76x. The fastest card is the GTX 480 with spin 
flip times in [0.18ns, 0.24ns] achieving a speedup from 126x to 95x. There 
is also another curve corresponding to a specifically tuned version for the 
GTX 480 carci3 and CC 2.0, obtaining 155x (0.147ns) for the fastest case. 



^ It's worth mentioning that in order to compare this value with CPU implementations 
of the Ising model (e.g., 8ns in [13]), one should take into account that the Potts model 
update routine requires an extra random number to choose where to flip the spin. In 
addition, using MWC doesn't provide the fastest execution times; other RNGs as LCG-32 
give better times but not completely reliable results due to their short period. For 
the sake of completeness, we report that eliminating one random number toss and using 
LCG-32 instead of MWC we obtain a spin flip time of 14.5ns for our CPU implementation. 

^Each block is filling the maximum 1024 threads, we also disable LI cache for a 
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It is important to notice that even when using newer CPU architectures hke 
Nehalem (X5550 - Ql 2009) the spin flip time only drops 2ns in the best 
case respect to the Core 2 Duo, and that Intel C++ Compiler (ICC) cannot 
do any better than that. 

Nevertheless, it should be noted that better CPU implementations could 
be possible, since most appropriate implementations for each architecture 
could be quite different from each other. For example, lower times can be 
attained for CPU using typewriter update scheme instead of a checkerboard 
one. For that reason, we hold the idea that a good measure to compare 
performances between GPU implementations is the "time per spin flip" , and 
the speedup respect to a CPU implementation is just additional illustrative 
information. 

The variations for the GPU cards are due to two competing factors in the 
loop of the update kernel. One is strictly decreasing with L and is related to 
the amount of global memory movements per cell. Since there is one RNG 
for each thread, the global memory for the RNG state is retrieved one time 
in the beginning and stored in the end, therefore the larger the L, this single 
load/store global memory latency is distributed into more cells. The second 
factor is increasing in L and is given by the inherent overhead incurred by 
a loop (comparison and branching), that for L = 32768 amounts to 4096 
repetitions. 

We also frame at 256 x 256 and 1024 x 1024, obtaining a 25% of perfor- 
mance penalty for the former, and a performance increase of 2% in the later. 
This gives us more evidence that the framing at 512 x 512 is an appropriate 
trade-off between memory consumption by the RNG and the speed of the 
code. 

Although there are divergent branches inside the code, even for deter- 
ministic cell writes (the boolean "or" operator semantics is shortcircuted), 
eliminating all divergent branches doing an arithmetic transformation does 
not bring any performance improvement. This shows the dominance of mem- 
ory requests over the integer and floating point operations, and the ability of 
the hardware scheduler in hiding the divergent branch performance penalty 
in between the memory operations. 

To our knowledge this is the first time the Potts model is implemented 



(free) slight performance improvement: compiler options -Xptxas -dlcm=cg -Xptxas 
-dlcm=cg. 
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in GPUs, so there is no direct performance comparison. There exist pre- 
vious works that deal with similar problems and that report performance 
measurements. Preis et. al [sj implemented a 2D Ising model in GPUs, they 
reported a speedup of 60x upon their CPU implementation using a GTX 280. 
Their implementation has the disadvantage that the system size is limited 
by the maximum number of threads per block allowed (enforcing L < 1024 
on GT200 and L < 2048 on GFIOO). Later on, Block, Virnau and Preis Q 
simulated the 2D Ising model using multi-spin coding techniques obtaining 
0.126ns per spin flip in a GT200 architecture. Weigel 0, llo| has also con- 



on 



sidered the 2D Ising model, obtaining a better 0.076ns per spin flip [48 
the same architecture, which is improved to 0.034ns per spin flip on a Fermi 
(GFIOO) architecture. Moreover, this was obtained with a single-spin coded 
implementation; however this gain is partially due to the use of a multi-hit 
technique updating up to /c = 100 times a set of cells while others remain 
untouched. Notwithstanding, Weigel obtains l^l 0.13ns per spin flip for the 
update without multi-hit and multi-spin, which is comparable with the result 
of the multi-spin coded version in [6|. Performance results on the 3D Ising 
model are also available [s], [lo[ . The Heisenberg spin glass model is simulated 
on a GPU in Ref. 0], and for this floating point vector spin, they achieve a 
0.63ns per spin flip update on a GFIOO architecture. Implementations of 
the Heisenberg model are also reported in 10| with times per spin flip down 
to 0.18ns on a Fermi architecture, representing impressive speedups (up to 
1029x). Recently, a GPU parallelization for the GF200 architecture was im- 



plemented in the Cellular Potts Model [49| with ~ 80a; speedup respect to 
serial implementations. 

We also conduct end-to-end benchmarks of a small simulation (g = 9, 
L = 1024, # of samples=3, T™„ = 0.71, T„,, = 0.73, 5T = 0.002, tf,,, = 2000, 
imax = 8000, (5t = 50). We obtain 193s for the GTX 280 and 8115s for the Intel 
Core 2 architecture, with a global speedup of 42x, very similar to the speedup 
reported by the microbenchmarks. The coincidence of microbenchmarks and 
end-to-end benchmarks results reaffirms the fact that all the optimization 
efforts should go to the update kernel updateCUDA. 



6. Metastability in the q-state Potts model 

Based on Binder's criterion described in Section[2]we analyze the existence 
of metastability for g > 4 as the system size increases. From Figl2] we see 
that for large enough values of q the energy branches attain the transition 
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temperature from both sides with a finite slope, even with a relatively poor 
temperature resolution. As q decreases, a closer approach to Tc is needed in 
order to distinguish whether a true singularity at Tc is present or n ot, since 



the spinodal temperatures are expected to be located very close to |20[ Tc. 

A power law divergence of the specific heat at Tc would imply the following 
behavior 

eT<T. = eo-A'{l-T/T,)^"'- (12) 
eT>T. = e,-A+(l-Tc/T)i--+ (13) 

with q;_, a+ > 



On the other hand, if well defined metastable states occur, the energy 
could be represented in terms of a specific heat diverging at pseudospinodal 
temperatures T+, T^~ 

eT<T. = e- -A-(l-T/T+)i--- (14) 
eT>T. = e%-A+{l-TjTY~-^ (15) 

If divergences for the specific heat occur at the pseudospinodals, we should 
see exponents a_ = a+ ~ in Eqs. f|T2|) and f lT3|) . since Eqs. ffT^ and f lTSj) 
imply finite slopes at Tc. 



We measure equilibrium curves for eT<Tc {gt>tJ starting from a ordered 
(disordered) initial state and performing a cooling (heating) procedure ap- 
proaching Tc, as described in section [31 The results are presented in Fig|6] 
and [71 In both figures a crossover of the curve's slope as we approach Tc can 
be observed for all values of q. Close enough to Tc, the curves for q = 9, 15, 96 
show exponents which are indistinguishable from 1, consistently with the ex- 
istence of metastability and divergences at spinodal temperatures different 
from Tc, at least for q >9. 



As pointed out by Binder 16|, to observe the crossover (if it exists at all) 
a temperature resolution at least AT = Tc — for the high energy branch 
(or AT = T^ — Tc for the low energy branch) is needed, where AT = |T — Tc|. 
A numerical estimation of the lower spinodal temperature predicted by Short 



Time Dynamics [20| is given by 



^^—^ - 0.0007 (ln(l + g-4))'-'\ (16) 
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Figure 6: (Color online) Log-log plot of energy differences versus temperatures T > Tc 
for various q. Data correspond to averages over 20 samples of systems size L = 2048, 
equilibration times ranging from 5 X 10^[MCS] to 2 x 10^ [MCS"] and measurement times 
of 5 X lO'' [MCS"], with sampling every 100 [MCS*]. Error bars were estimated considering 
a 90% confidence interval (only some representative error bars are shown for clarity). Full 
color lines arc power- law fits of the form \ {e — ed)/ed\ = A{l — Tc/T)°- (resulting exponents 
a are showed in the labels). Dashed vertical lines of different colors indicate correspond 
toT = Te + AT{q), with AT = - T" and T" from Eq.dlll). The inset shows q = 9 
curves for different system sizes, the full orange curve indicates the slope 1. 
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Figure 7: (Color online) Log-log plot of energy differences versus temperatures T < Tc 
for various q. Data correspond to averages over 20 samples of systems size L = 2048, 
equilibration times ranging from 5 X lO'' [MCS'] to 2 x 10^ [MC5] and measurement times 
of 5 X 10^[AICS], with sampling every 100 [MCS*]. Error bars were estimated considering 
a 90% confidence interval (only some representative error bars are shown for clarity). Full 
color lines are power-law fits of the form (e — eo)/eo = A{1 — T/Tc)"" (resulting exponents 
a are showed in the labels). 
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The vertical dashed hnes in FigE] correspond to T = Tc + AT{q), as predicted 
from Eq.f ll6p according to the previous criterion. The coincidence with the 
crossover points for all values of q shows a complete agreement between the 
present results and those from Short Time Dynamics calculations. To attain 
the desired temperature resolution the system size has to be large enough, 



since finite size rounding errors are expected to decay as 1 /L [16|, |45| . This is 
illustrated in the inset of Fig E] for the particular case q = 9, where a strong 
finite size effect is observed for L = 128. A rough estimation of the minimum 
size required to reduce the error L ^ 1/AT predicts L = 400. We see that 
this finite size effect is suppressed for sizes L > 1000. Moreover, further 
increase of the system size does not change the behavior of the curves close 
to Tc. 

We have no estimations for for arbitrary values of q, but a close look 
to the curves in Fig|2] suggest that T+ is closer to Tc than is. This is 
consistent with the behavior observed in FigJTJ where crossovers occur closer 
to Tc than in Figj6l 

Our results for q = 6 are not conclusive. For instance, in the high energy 
branch we observe the previously discussed crossover, but the slope changes 
from 0.6 to 0.8. Such variation is of the same order of the fitting error below 
the crossover. This is because statistical fluctuations in the energy become 
very important at the required temperature resolution level {AT/T^ < 10~^), 
as can be seen in FigEl Hence, to obtain a clear answer a very large sample 
size (one can roughly estimate ~ 2000) and probably a larger system size is 
needed. In fact, we performed simulations with a sample size 50 (for L = 
2048), without any improvement in the results. We even simulate systems of 
L = 8192 with a sample size on the order of 10 with no appreciable change. 

The situation is more difficult for the low energy branch, where no clear 
evidence of crossover is observed (see FigJT]). However, one could expect the 
existence of an upper spinodal temperature T+ located closer to Tc than the 
lower one and therefore a higher temperature resolution (together with 
larger system and sampling sizes) would be needed to elucidate whether there 
is metastability or not. 

7. Discussion 

We implemented a CUDA-based parallel Monte Carlo algorithm to sim- 
ulate the Statistical Mechanics of the q-state Potts model. The code allows 
a speedup (compared with an optimized serial code running on a CPU) from 
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42x in the GTX 280 card up to 155x in a GTX 480, with an average time per 
spin flip of 0.54ns down to 0.147ns respectively. Those times are of the same 
order of previous implementations in the simpler case of the Ising model, 
without the usage of sophisticated programming techniques, such as multi 
spin coding. Besides the speedup, the present algorithm allows the simula- 
tion of very large systems in very short times, namely ~ 10^ spins with an 
average time per MCS of 0.15s. Such performance is almost independent of 
the value of q. The key factors to achieve those numbers is the per-thread 
independent RNG that is fast and takes only a few registers, the framing 
scheme that increases the amount of computation done by each thread and 
at the same time it bounds the number of independent RNG needed, and 
finally the cell-packing mapping that orders the memory access. 

The possibility of performing high speed simulations at large enough sys- 
tem sizes allowed us to study the metastability problem in the two dimen- 
sional system based on Binder's criterion, namely, on the existence or not of 
specific heat singularities at spinodal temperatures different from the transi- 
tion one (but very close to). Our results provide a positive numerical evidence 
about the existence of metastability on very large systems, at least for g > 9. 

Even when our results for g = 6 suggest the same behavior as for larger 
values of g, they could also be consistent with the absence of metastability. 
Hence, one cannot exclude the existence of a second critical value 4 < g* < 9 
such that metastability disappears when 4 < g < g*. 

Although the present implementation was done for a two dimensional 
system with nearest neighbors interactions (checkerboard update scheme), 
its generalization to three dimensional systems and/or longer ranged inter- 
actions is feasible, but some features should be adjusted. For the general- 
ization to the 3D case, the checkerboard scheme defining two independent 
sub-networks persists, however the cell-packing scheme should be updated 
conveniently. For the 2D case with first and second neighbors interactions, 
there are nine independent sub-networks to update instead of two. The com- 
bination of both generalizations is direct. 

The present implementation is based on the simplest single-spin flip al- 
gorithm namely. Metropolis. Its extension to more sophisticated single spin 



flip algorithms (See for example Refs.[50|, [5l[) is also straightforward and 



represent an interesting prospective in the field. In particular, temperature 
reweighting 5^ or other histogram based techniques (see for example 47| ) 
can be implemented by keeping track of the energy changes at each spin flip 
for each step, instead of making the calculation of the energy over the whole 
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system at each step. This kind of tracking could be done without loose of 
performance by implementing a paralell acumulation of local energy changes 
on-the-fly taking advantage of the GPU's hierarchic memory scheme. 

Besides its theoretical interest, the large-g Potts model (or minor varia- 
tions of it) is widely used for simulating the dynamics of a large _ variety of 
systems, such as soap bubbles and foam 
segregation js^], biological c ells 
neural networks 



53l . |54| | , grain growth [55|, |56| , gene 



58| . tumor migration [59[, image segmen- 



tation [60J , neural networks [61| and social demographics behavior [62|, |63 
The present implementation of the Potts model on GPUs, or easy modifi- 
cations of it, would result helpful for some of the above cited applications. 
The possibility of simulating bigger systems and having results faster than 
usual should be welcomed in the statistical physics community. Our CUDA 
code is available for download and use under GNU GPL 3.0 at our Group 
webpage j44 . 
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